feat: add parallel distance computation and vectorized pipeline#43
feat: add parallel distance computation and vectorized pipeline#43
Conversation
Pull Request Test Coverage Report for Build 142
💛 - Coveralls |
|
On IBM Power8: (venv-pynomaly) vconstan@SNA-MINSKY-N03:~/projects/PyNomaly$ python examples/numba_speed_diff.py
/home/vconstan/projects/PyNomaly/PyNomaly/loop.py:518: NumbaWarning:
Compilation is falling back to object mode WITH looplifting enabled because Function _compute_distance_and_neighbor_matrix failed at nopython mode lowering due to: scipy 0.16+ is required for linear algebra
File "PyNomaly/loop.py", line 537:
def _compute_distance_and_neighbor_matrix(
<source elided>
diff = clust_points_vector[p[0]] - clust_points_vector[p[1]]
d = np.dot(diff, diff) ** 0.5
^
During: lowering "$88call_method.23 = call $82load_method.20(diff, diff, func=$82load_method.20, args=[Var(diff, loop.py:536), Var(diff, loop.py:536)], kws=(), vararg=None)" at /home/vconstan/projects/PyNomaly/PyNomaly/loop.py (537)
@staticmethod
/home/vconstan/.conda/envs/venv-pynomaly/lib/python3.8/site-packages/numba/core/object_mode_passes.py:177: NumbaWarning: Function "_compute_distance_and_neighbor_matrix" was compiled in object mode without forceobj=True.
File "PyNomaly/loop.py", line 519:
@staticmethod
def _compute_distance_and_neighbor_matrix(
^
warnings.warn(errors.NumbaWarning(warn_msg,
/home/vconstan/.conda/envs/venv-pynomaly/lib/python3.8/site-packages/numba/core/object_mode_passes.py:187: NumbaDeprecationWarning:
Fall-back from the nopython compilation path to the object mode compilation path has been detected, this is deprecated behaviour.
For more information visit http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit
File "PyNomaly/loop.py", line 519:
@staticmethod
def _compute_distance_and_neighbor_matrix(
^
warnings.warn(errors.NumbaDeprecationWarning(msg, |
|
Given that there is a trade-off between the number of cores to utilize in parallel computation and communication between the parallel threads, it may be nice to allow users to set the number of concurrent threads to execute in parallel. This seems to be set through a Numba environmental variable, and may be worth exploring adding as an additional, optional parameter when executing distance calculations in parallel: https://numba.pydata.org/numba-doc/latest/user/threading-layer.html#setting-the-number-of-threads |
|
Added a More investigation is needed to see if the above behavior is machine-specific or code related, but we now have the ability to parallelize distinct portions of the code and set the number of threads as well when using numba. |
|
Results from another machine: |
Results from another run. |
|
Results from another machine (4 core CPU, running from WSL): |
|
Refactored how the processing is handled so that we see a speed improvement when using Numba and upping the number of cores. Once I handle the below issue, I'll report back with some numbers in regards to speed of computation. To accomplish multi-core processing, this necessitated changes in the progress bar, which is still a work in progress. One of the key challenges currently is to flush the stdout in such a way that is compatible with Numba. While print statements are supported with Numba compiled functions, it doesn't seem that sys.stdout.flush() is supported. |
|
Placing this issue on hold while other repository issues are resolved - this is low priority and can be resolved at a later time. |
Updated `readme.md` to update the total number and monthly number of package downloads.
chore: remove Python 3.6 and 3.7 support
chore: update readme.md with another core library example
feat: refactor Validation class for ease of use
Rewrite the distance computation engine from scratch on top of v0.3.5: - Vectorized kNN distances using NumPy broadcasting with chunked processing for memory efficiency and progress bar support - Add n_jobs parameter for cross-cluster multiprocessing via concurrent.futures (n_jobs=-1 uses all cores) - Restructure Numba path with non-generator kernels that support numba.prange for thread-level parallelism - Optional scipy.spatial.distance.cdist and scipy.special.erf acceleration when scipy is available - Vectorize _standard_distances, _prob_distances, and _norm_prob_outlier_factor pipeline methods - Fully backward-compatible: all existing API calls work unchanged Closes #36 Made-with: Cursor
5632d31 to
5e93be2
Compare
Update version across loop.py, setup.py, and README badge. Add changelog entry documenting all new features and improvements. Made-with: Cursor


Summary
This PR addresses #36 by implementing parallelization and vectorization of PyNomaly's distance computation and pipeline, rebased onto the current v0.3.5 codebase.
Changes
scipy.spatial.distance.cdist), yielding significant speedups without new required dependenciesn_jobsparameter: Added cross-cluster multiprocessing viaconcurrent.futures.ProcessPoolExecutor. Setn_jobs=-1to use all CPU cores. Follows the scikit-learn conventionnumba.prangefor proper thread-level parallelism (the previous generator-based approach was incompatible with Numba's parallel mode)scipy.spatial.distance.cdistfor distance computation andscipy.special.erffor the error function when scipy is available, with graceful fallback to pure NumPyforloops in_standard_distances,_prob_distances, and_norm_prob_outlier_factorwith vectorized NumPy operationsAPI
Fully backward-compatible. The only addition is the optional
n_jobsparameter (default1):All existing function calls, examples, and usage patterns continue to work unchanged.
Testing
test_n_jobs_equivalence,test_n_jobs_single_cluster,test_n_jobs_invalidCloses #36